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Abstract 

Non-gaussian component analysis (NGCA) introduced in [51] offered 
a method for high dimensional data analysis allowing for identifying 
a low-dimensional non-Gaussian component of the whole distribution 
in an iterative and structure adaptive way. An important step of the 
NGCA procedure is identification of the non-Gaussian subspace using 
Principle Component Analysis (PCA) method. This article proposes a 
new approach to NGCA called sparse NGCA which replaces the PCA- 
based procedure with a new the algorithm wc refer to as convex projection. 
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1 Introduction 

Numerous mathematical applications in econometrics or biology are confronted 
with high dimensional data. Such data sets present new challenges in data 
analysis, since often the data have dimensionality ranging from hundreds to 
hundreds of thousands. This means an exponential increase of the computa- 
tional burden for many methods. On the other hand the sparsity of the data 
in high dimensions entails that data thin out in the local neighborhood of a 
given point x . Hence statistical methods are not reliable in high dimensions if 
the sample size remains of the same order. This problem is usually referred to 
as "curse of dimensionality" (cf. [8], [22] )■ The standard approach to deal with 
the high dimensional data is to introduce a structural assumption which allows 
to reduce the complexity or intrinsic dimension of the data without significant 
loss of statistical information [19], |17j . 

Let a random phenomenon is observed in the high dimensional space 
while the intrinsic dimension of this phenomenon is much smaller, say m . 
From a geometrical point of view m is the dimension of a linear subspace 
that approximately contains the structure of the sample data. Alternatively 
we can consider this structure as a low dimensional signal embedded in high 
dimensional noise. Consequently a lower dimensional, compact representation 
that according to some criterion, captures the interesting information in the 
original data, is sought. In this paper we assume that we have a sample of data 
lying approximately in a m < d dimensional linear (target) subspace I C 
of M*^ . In order to reduce the problem dimension one looks for a mapping from 
the original data space onto this subspace. 

In the statistical literature the Gaussian components of the data distri- 
bution are often considered as entropy maximizing and consequently as 
non- informative noise [5]. It is well known that for high-dimensional clouds 
of points most low-dimensional projections are approximately Gaussian 
The Non-Gaussian Component Analysis (NGCA), introduced in [23], is based 
on the assumption that the structure of the data is represented by a low 
dimensional non-Gaussian component of the observation distribution, as 
opposed to a full dimensional Gaussian component, considered as noise. Thus 
the objective of NGCA is to "kill the noise" rather than to describe the whole 
multidimensional distribution. Note that the suggested way of treating the 
Gaussian distribution as a pure nuisance in general exclude the use of the 
classical Principle Component Analysis (PCA) which simply searches for the 
directions with of largest variance. 

In the same way as a number of projection methods of feature extraction (e.g. 
Projection Pursuit [10], Partial Least Square Regression [281 [29], Conditional 
Minimum Average Variance Estimation [3^ or Sliced Inverse Regression 
[15\ m [2]), when implementing the NGCA we decompose the problem of 
dimension reduction into two tasks: the first one is to extract from data a set 
of vectors which are close to the target space T; the second is to construct 
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a basis of the target space from these vectors. These characteristics can also 
be found in the unsupervised, data driven approach of SNGCA, presented in 
this article. When compared to available dimension reduction methods (e.g. 
Principal Component Analysis [13], Independent Component Analysis [TT] 
or Singular Spectrum Analysis [7]) SNGCA does not assume any a priori 
knowledge about the density of the original data. 

The proposed method, as well as NGCA, is an iterative algorithm which 
is structure adaptive in the sense that every new step essentially uses the 
result of previous iterations. The main difference between NGCA and SNGCA 
algorithms lies in the way the information is extracted from the data. The 
algorithm of NGCA heavily relies upon the Euclidean projection and the 
PCA of the set of the estimated vectors. In the case when data dimension is 
important and the sample size is moderate, computation of the I2 -projection 
can amplify the noise. Moreover, when most of the estimated vectors do not 
contain information about the space I but are mainly noise, the results of 
using the PCA algorithm to extract the basis of feature space can be very poor. 
The reason for that is that the PCA algorithm is known to accumulate the 
noise. To address this issue the SNGCA uses convex programming techniques 
to estimate elements of the target subspace by "convex projection", what 
allows to bound uniformly the estimation error. Further, another technique 
of convex analysis, based on computation of rounding ellipsoids of the set of 
estimated vectors, is used to extract the subspace information. These changes 
allow the SNGCA algorithm to treat large families of candidate vectors with- 
out increasing significantly the variance of the estimation of the target subspace. 

The paper is organized as follows. First we describe the considered set-up in 
Section [2] and discuss the main ideas behind the proposed approach. The formal 
description of the algorithm is given in Section [3l A simulation study of the al- 
gorithms is presented in Section [H where we compare the performance obtained 
by SNGCA algorithms and by several other methods of feature extraction. 

2 Non- Gaussian Component Analysis 
2.1 The setup 

The following setting is due to j2l]. Let Xi, ...,Xn be i.i.d. from a distribution 
IP in M'^ . We suppose that IP possesses a density p with respect to the 
Lebesgue measure on M.'^ , which can be decomposed as follows: 

p{x) = (l)^=o^^{x)q{Tx). (2.1) 

Here 4'^,j: stands for the density of the multivariate normal distribution 
AA(/U, S) with parameters p ^ (expectation) and S G W^^'^ positive 
definite (covariance matrix). The function q : R"^ M with m < d has to be 
nonlinear and smooth. T S j^mxc' an unknown linear mapping. Naturally, 
we refer to X = range T as target or non- Gaussian subspace. For the sake of 
simplicity let us assume = where iE'[X] stands for the expectation of 
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2.2 Estimation of elements of the target subspace 



X. 



Though the representation ()2.ip is not uniquely defined, the subspace T C M'^ 
is well defined as well as the Euclidean projector 11* on X . By analogy with the 
regression case [U [Ml [E] , we could also call I the effective dimension reduc- 
tion space (EDR-space). We call m effective dimension of the data. In many 
applications m is unknown and has to be recovered from the data. Our task 
is to recover 11* . The model structure (j2.ip allows the following interpreta- 
tion (cf. [21]) : we can decompose the random vector X into two independent 
components 

X = U*X + {I -U*)X = Z + u, 

where Z is a non- Gaussian ?ti -dimensional signal and u is [d — m)- 
dimensional normal noise. 



As we have already noticed in the introduction, SNGCA algorithm relies upon 
two basic operations: the first is to construct a set of vectors, say /3i,...,/3j, 
which are "close" to the target subspace; the objective of the second is to 
compute an estimate H of the Euclidean projector 11* on X using the set 



2.2 Estimation of elements of the target subspace 

Estimation of elements of I. The implementation of the first step of 
SNGCA is based on the following result (cf. Theorem 1 of [24 



Theorem 1. Let X follow the distribution with the density p which satisfies 
i2. 1\) and let 1E[X] = . Suppose that a function ijj : M is continuously 

differentiable. Define 



/3{^p) := lE[Vip{X)] = / V^p{x)p{x)dx 



(2.2) 



where Vip stands for the gradient of ip . Then there exists a vector (3 £ I such 
that 



< 



iT.-^JEiXipiX)]] 



xtl:{x)p{x) dx 



In particular, if ]E[X^{X)] =0, then I3{ip) G I . 
The bound of Theorem [T] implies that 



\\{i-n*)mh< 



xiIj{x)p{x) dx 



(2.3) 



where / is the (i -dimensional identity matrix and 11* is the orthogonal 
projector on I . 



2 Non-Gaussian Component Analysis 



5 



Based on this result, [23] suggested the following way of constructing a set of 
vectors (3 which approximate the target space I. Let hi,...,hL be smooth 
bounded functions on R'^ . Define 7/ = lE[Xhi{X)] and ??/ = lE[Vhi{X)]. 
These vectors are not computable because they rely on the unknown data dis- 
tribution, but they can be well estimated from the given data. Next, for any 
vector c E , define the vectors /3(c), 7(c) G M"^ with 

L L 

1=1 1=1 

Then by Theorem [H /3(c) £ I conditioned that 7(c) = 0. Indeed, if we set 
■tp{x) = J2i cihi{x) , then iE[X^(X)] = , and by (lOD . 

7(c) = lE[Vij{X)] G J. 

The approach of [23] is to compute the vectors of coefficients c G which 
ensure 7(c) ~ and then to use the corresponding empirical analogs of /3(c) 
to estimate the target space. More precisely, given the observations Xi, ...,Xn 
compute the set of vectors (empirical counterparts of r]i and ji ) according to 

N N 

ii = N-^Y.^MXr), m = N-'Y,^hi{Xi). (2.4) 

i=l i=l 

Similarly define for c G 

L L 

Pic) = X = X ^^^^ ■ 

1=1 1=1 

One can expect that for vectors c with 7(c) = 0, the vectors /3(c) are "close" 
to I. 

Below we follow a similar way of constructing /3(c) with an additional 
constraint that the considered vectors of coefficients c satisfy ||c||i < 1. This 
constraint allows for both efficient numerical algorithms and sharp error bounds. 

The test functions hi can be generated as follows: let be a unit ball Bd = 
{x G R'^l : ||x||2 < 1} and let f{x,u;) , / : i3 x M*^ ^ R be a continuously 
differentiable function. Consider the functions hi{x) = f{x,u!i) , for some uji G 
B, I = 1,...,L. The choice of family f{-,uj) is an important parameter of 
the algorithm design. For instance, in the simulation examples of Section |3] we 
consider the following families: 

f{x,uj) = tanh(a;^x)e-"l'^l'2/2, (2.5) 
f{x,uj) = [l + (L^^x)2]-iexp^^^-"li^li2/2, (2.6) 

and LOi, I = 1, L are unit vectors in R'^ . The next result justifies the proposed 
construction. 
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2.2 Estimation of elements of the target subspace 



Theorem 2. Suppose that f is continuously differentiahle in w and for some 
fixed constant f* and any lo B^, x S M*^ 



Var 



Yav[Xjf{X,u;)] < fl 
d 

f{X,u) 



dxi 



< fl, Gov 



Coy[XjV^ f{X,u;)] < f^I, 
d 



< fll, 



Consider the (random) set 
e = {c e 



|c||i < 1, 7(c) =0). 



(2.7) 



Then for any e > there is a set A dVt of probability at least 1 — e such that 
on A for all c G C , 



where 



(I - n*)^(c)||2 < Vd5N{l + \\^~%), 



iV-V2 inf {5no/rA + 2A-i[ed + log(2d/e)]} 

A<A*AfV2 



and Cd = 4(i log 2 . 

The proof of the theorem is given in the appendix. 

Due to this resuh, any vector c E C can be used to produce a vector /3(c) 
which is close to the target subspace X. However, such constructed vectors are 
only informative if its length is significant relative to the estimation error. 

We therefore compute a family of such coefficient vectors c by solving the 
following optimization problems: for a fixed unit vector ^ G M*^ called a probe 
vector, find 



argmin ||'? — ?7(c)||2, subject to 7(c) 

:||c||i<l 



0. 



(2.^ 



where 77(c) = cfii . This is a convex optimization problem which can be effi- 
ciently solved by some numerical procedures, e.g. by the interior point method. 
Then we set 



P = P(c) = Y,ciVi 
I 



(2.9) 



It can be easily seen that for ^ _L T , the solution /3 fulfills /3 ~ . On 
the contrary, if ^ G X, then there is a solution with significantly positive 
||c||i and ||/3(c)||2. This leads to the following strategy: In the first step of 
the algorithm when there is no information about I available, the probe 
vectors s-i's generated randomly from Bd ■ In the next 

steps we apply the idea of structural adaptation by generating the essential 
part of the vectors from the estimated subspace I . For details see Section O 
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We address now the implementation of the second step of SNGCA - inferring 
the projector 11* on I from estimations of elements of X. 

Recovering the target subspace. Suppose that we are given vectors 
Pi, Pj which satisfy 

\\Pj-PA2<Q, 

for some Pj £ 2 , j = 1, J . The problem of estimating the subspace X from 
Pj is a special case of the so called Reduced Rank Regression (RRR) problem. A 
simple and popular PCA estimate of the projector IT* on X is given by solving 
the quadratic optimization problem 

J 

n = argmin^ 11(1 -n^)^,- 111, 

where the minimum is taken over all projectors of rank m . One can easily 
verify that 11 projects on the subspace in M*^ generated by the first in 
principal eigenvectors of the matrix PjPj ■ However, if the number of 

informative vectors Pj is small with respect to J , the quality of estimate 
n can be extremely poor. To address this drawback of the PCA solution we 
consider a sparse estimate of X which uses rounding ellipsoids for the set 

For a symmetric positive-definite matrix B and r > 0, the ellipsoid £r{B) is 
defined as 

£,.(S) = {x G M"' I x^Bx < r^}, 
For a < 1 ,£{B) = 8,i{B) is a -rounding ellipsoid for a convex set S if 

£i/^iB)<Z§<Z£{B). 

Note that such ellipsoid exists with a = d"^/^ due to the Fritz John theorem 
[12j . Furthermore, numerically efficient algorithms for computing \/d -rounding 
ellipsoids are available, see e.g. [I8]. So, for recovering the spatial information 
from the vector system {±Pj}j^^ one can look for the d^/^ rounding ellipsoid 

for the convex hull S of points {±Pj}j^i . 

We measure the quality of estimation of the subspace X by the closeness of the 
estimated projector 11 to 11* : 

e(x,x) = ||n-n*||^ = Tr[(n-n*)2]. (2.10) 

The property of the spatial information recovery, based on the idea of rounding 
ellipsoids, is described in the following theorem. 

Theorem 3. 1. Let § be the convex envelope of the set {ib/3j}, j = 1, J , and 
let £i(i?) he an ellipsoid inscribed into S , such that t^(B) is \fd -rounding 
ellipsoid for S . Then for any unit vector f _L X , 
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2. 2 Estimation of elements of the target subspace 



2. If there is jj, & M'^ with > and J2j f^j — 1 such that 

where Am(^) stands for the m -th principal eigenvalue of A, then 

3. Moreover, let 11 = T^r^ where is the matrix of m principal eigenvec- 
tors of . Then 



in-n*ii? < 



4:Q^dy/d 



'2 - A* - 2^2- 
The proof of the theorem is presented in the appendix. 



The results of Theorems [2] and [3] provide a kind of theoretical justification 
for the algorithms, presented in the next section. Indeed, suppose that the 
test functions hi,..., hi and the vectors are chosen in such a way 

that there are at least m vectors with "significant" projection on I among 
/?!,... as in l\2.9h . Then the projector estimate IT, computed using the 
ellipsoid S{B) which is rounding for the set {±/3j} , with high probability will 
be close to 11* . 



However, the results about the estimation quality depend critically on the 
dimension d . Numerical results also indicate that with growing dimension, 
the fraction of non-informative vectors Pj increases leading to the situation 
when some of the longest semi-major axis of £^ are also non-informative and 
nearly orthogonal to X. This enforces us to introduce an additional check of 
non-normality for the directions suggested by the estimated ellipsoid £ . 

Identifying the non-Gaussian subspace by statistical tests: Currently 
the estimation procedure of the vectors P{iph,c) itself does not allow the 
identification of the semi-axis within the target space. Hence the basic idea 
is to apply statistical tests on normality w.r.t. the significance level a to the 
original data from M*^ projected on every semi-axis of £^ . If the hypothesis 
of normality is rejected w.r.t. the projected data, the corresponding semi-axis 
is used as a basis vector for the reduced target space T . 

Structural adaptation: At the beginning of the algorithm, we have no prior 
information about I and therefore sample the directions (^j and u>i randomly 
from the uniform law. However, the SNGCA procedure assumes that the ob- 
tained estimated structure 2 delivers some information about I which can 
be used for improving the sample mechanism and therefore, the final quality 
of estimation. This leads to the structurally adaptation iterative procedure [S]: 
the step of estimating the vectors and the step of estimating subspace 
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I are iterated, the estimated structural information given by I is used to im- 
prove the quahty of estimating the vectors f3j in the next iteration of SNGCA. 
In our implementation, we sample a fraction of directions and due to 
the previously estimated ellipsoid B and the other part randomly. However the 
number of the randomly selected directions remains constant during iteration. 
In the next section we present the formal description of SNGCA. 

3 Algorithms 

This section describes the principal steps of the procedure. The detailed de- 
scription is given in the Appendix. 

3.1 Normalization 

As a preprocessing step the SNGCA procedure uses a componentwise normal- 
ization of the data. Let a = (o"!, . . . cr^) be the standard deviations of the data 
components of xi, . . . ,Xd ■ For i = 1, . . . ,N the componentwise normalization 
of the data is done by 5^ = diag(o"~^)Xj . 

3.2 Estimation of the vectors from non-Gaussian subspace: 

Let {iOji} , I = 1,...,L, and , j = 1,...,J be two collections of unit 
vectors called the measurement directions. Define for all j = 1, . . . , J and 
I < L, the functions hji{x) = f{x,ujji) , and compute the vectors jji and rjji 
due to p.4p . Next, for every j < J , compute the vector cj by solving the 
problem ()2.8p with ^ = leading to the vector (3j by (12. 9p . 

3.3 Computing the estimator IT of the projector U* 

The projector 11 is constructed on the base of the first m principal eigenvec- 
tors of the rounding ellipsoid £ for the set 8 spanned by the vectors ±Pj , 
j = 1, . . . , J . To build the ellipsoid £ we use the algorithm in [TH] which in 
fact computes the minimum volume ellipsoid (MVEE) which covers 8 . For 
convenience we provide the algorithm in the appendix. 

3.4 Building the subspace S using statistical tests 

In order to construct the projector 11 the identification of the m principal 
eigenvectors of £ that approximate I is required. In projecting the data onto 
the semi-axis of £ and testing the projected data on normality the projective 
approach from the estimation step is repeated. 

Since statistical tests specialized for a certain deviation from the normal distri- 
bution, are more powerful, we use different tests inside of SNGCA in order to 
cope with different deviations from normality of the projected data. To be more 
precise we use the -test according to D'Agostino-Pearson [31] to identify a 
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3.5 Structural Adaptation 



significant asymmetry in the projected distribution and the EDF-test accord- 
ing to Anderson-Darhng [T] with the modification of Stephens |25] . which is 
sensitive to the tails of the projected distribution. In order to confirm these 
test results from above we use the Shapiro- Wilks test [22] based on a regression 
strategy in the version given by Royston [20^ [2T] . Once we have classified the 
semi-axis of £^ as being close to the target space we can use the identified 
subset of axis in the structural adaptation step. 

3.5 Structural Adaptation 

The first step of the algorithm assumes that the measurement directions ujji 
and are drawn randomly from the unit sphere in . At each further step of 
the algorithm we can use the result of the previous iterations of SNGCA in order 
to accumulate information about X in a sequence Xi , T2 , . . . of estimators of the 
target space. This information is used to draw a fraction of the measurement 
directions from the estimated subspaces and the other part of such direction is 
selected randomly. The procedure is described in detail in algorithm [H 

3.6 The stopping criterion 

Suppose that X is a priori given. Then the convergence of SNGCA can be mea- 
sured according to the criterion ()2.10p . More precisely we assume convergence 
if the improvement of the error measured by (j2.10p from one iteration to the 
next one is less than 5 percent of the error in the former iteration. To this 
end the maximum angle between the subspaces specified by the matrix of 



eigenvectors V'^^'> = \v'^^\v^\ . . and = \^^-^^^\v^'^^\ . . given by 




is computed. In the next section we demonstrate the improvement of the esti- 
mation error between subsequent iterations of SNGCA. 

4 Numerical results 

The aim of this section is to compare SNGCA with other statistical methods 
of dimension reduction. The reported results from Projection Pursuit (PP) and 
NGCA were already published in [23]. 

4.1 Synthetic Data 

Each of the following test data sets includes 1000 samples in 10 dimension 
and each sample consists of 8 -dimensional independent, standard and homo- 
geneous Gaussian distributions. The other 2 components of each sample are 
non-Gaussian with variance unity. The densities of the non-Gaussian compo- 
nents are chosen as follows: 



cos{9) = max 




(A) Gaussian mixture: 2 -dimensional independent Gaussian mixtures with 
density of each component given by 0.5 0_3^i(x) -|- 0.5 03,i(x) . 
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(B) Dependent super-Gaussian: 2 -dimensional isotropic distribution with 
density proportional to exp(— ||x||) . 

(C) Dependent sub-Gaussian: 2 -dimensional isotropic uniform with constant 
positive density for ||a;||2 < 1 and otherwise. 

(D) Dependent super- and sub-Gaussian: 1 -dimensional Laplacian with den- 
sity proportional to exp(— I^Lapl) and 1 -dimensional dependent uniform 
U{c,c + 1) , where c = for \xLap\ < log(2) and c = — 1 otherwise. 

(E) Dependent sub-Gaussian: 2 -dimensional isotropic Cauchy distribution 
with density proportional to A(A^ — x'^)~^ where A = 1 . 

That means, that the non-normal distributed data are located in a linear 
subspace. 

In the sequel we compare SNGCA with PP and NGCA using the test data 
sets from above and the estimation error defined in (j2.10p . Each simulation 
is repeated 100 times. All simulations are done with the hyperbolic tangent 
index as in (12.50 . Since the speed of convergence varies with the type of non- 
Gaussian components we use the maximum number maxlter = 31og(d) of 
allowed iterations to stop SNGCA. In the experiments the error measure €{I,I) 
is used only to determine the final estimation error. All simulations other than 
whose w.r.t. model (C) are computed with a componentwise pre- normalization. 




(C) (D) 




(E) 

Figure 4.1: densities of the non-Gaussian components: (A) 2 d independent Gaussian mixtures, 
(B) 2d isotropic super-Gaussian, (C) 2d isotropic uniform and (D) dependent Id Laplacian 
with additive 1 d uniform, (E) 2 d isotropic sub-Gaussian 
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4-1 Synthetic Data 



Figure BTT] illustrates the densities of the non-Gaussian components of the test 
data. For all numerical experiments reported in this article the dimension of 
the target space T is a priori given as a tuning parameter for the algorithm. 

Since the optimizer used in PP tends to trap in a local minima in each of the 
100 simulations, PP is 10 times restarted with random starting points. The 
best result w.r.t. ()2.10p is reported as the result of each PP-simulation. In all 
PP-simulations the number of non-Gaussian dimensions is a priori given. In 
the next figure 14.21 we present boxplots of the error ()2.10p obtained from the 
methods PP, NGCA and SNGCA. 
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(D) 
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(E) 

Figure 4.2: performance comparison in 10 dimensions of PP and NGCA versus SNGCA (wrt. 
the error criterion £{I,I) ) using the index tanh{x) . The doted line denotes the mean, the 
solid lines the variance of (|2.10|) . 
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Concerning the resuhs of SNGCA on the data sets (A) and (D) we observe a 
shghtly inferior performance compared to NGCA. In case of model (A) this 
is due to the fact that most of the data projections have almost a Gaussian 
density. Consequently the decrease of the estimation error is slow with 
increasing number of iterations. In case of the model (D) the higher variance 
of the results indicate that the initial sampling of the data sets gives a poor 
result. Consequently more iterations are needed to get an estimation error that 
is comparable to the result of NGCA. In order to illustrate this interpretation 
we report in table ()4.ip the progress of SNGCA w.r.t. estimation error e{I,I) 
in each iteration for every test model. 
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0.1371e-6 
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0.0450e-3 


0.0031e-6 
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0.0416e-3 


0.0033e-6 
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0.0360e-3 


0.0014e-6 








5 


0.0287e-3 


0.0024e-6 









(E) 



Table 4.1: Progress of SNGCA for test models in 10 dimensions with increasing number j 
of iterations. The empirical mean of the error £(X,X) defined in (|2.10p is denoted by jit and 
CTj is its empirical variance. 

Illustration of one-step-improvement: We shall now illustrate the iterative 
gain of information about the EDR space. To this end we use the projection of 
(ij to the EDR-space in order to demonstrate, how the algorithm works. Figure 
l4.3l shows that dist(/3,I) decreases with increasing number of iterations. We ob- 
serve, that estimators f3 with higher norm tend to be close to I . Nevertheless, 
this can not be assured for much higher dimensions. Moreover the improvement 
in each iteration depends on the size of the sampling of measurement directions. 
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4-1 Synthetic Data 



1-th iteration 



D D.l 2 0.3 D.4 D.5 

in 

3-th iteration 



1 

D.e 

g D.6 
I 0.4 
0.2 
B 



r 



a 0.1 02 0.3 04 0.5 

IPI 

5-th iteration 



O.E 
g 6 
S 0.4 

0.2 



102 3 04 5 



2-th iteration 




0.1 0.2 03 4 05 
4-th iteration 




0.1 0.2 03 4 05 

II ^11 
6-th iteration 



1 2 03 4 5 
II ^11 



Figure 4.3; illustrative plots of SNGCA applied to toy 20 dimensional data of type (C) (see 
section|4]): We show ||/3|| vs. cos(^(/3,X)) for different iterations of the algorithm where X is 
the a priori known EDR-space. 




Figure 4.4: results wrt. £{T,T) with deviations of Gaussian components following a geomet- 
rical progression on [10"'^, 10"^] where r is the parameter on the abscissa) . 
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Now let us switch to the question of robustness of the estimation procedure with 
respect to a bad conditioning of the covariance matrix S of the data. In figure 
14.41 we consider the same test data sets as above. The non-Gaussian coordinates 
always have unity variance, but the standard deviation of the 8 Gaussian 
dimensions now follows the geometrical progression 10-^ l0-^+2r/7^ . . . , 10'' 
where r = 1, . . . , 8 . Again we apply a componentwise normalization procedure 
to the data from the models (A), (B), (D), (E). We observe that the condition 
of the covariance matrix heavily influences the estimation error for the methods 
NGCA and PP(tanh). In comparison SNGCA is independent of differences 
in the noise variance along different directions in most cases. Only the detec- 
tion of the uniform distribution by SNGCA is influenced by the condition of S . 




Figure 14.51 compares the behavior of SNGCA with PP and NGCA as the 
number of standard and homogeneous Gaussian dimensions increases. As 
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4-1 Synthetic Data 



described above we use the test models with 2 -dimensional non-Gaussian 
components with unity variance. We plot the mean of errors e(X,X) over 100 
simulations w.r.t. the test models (A) to (E). 

Again concerning the mean of errors e(X,X) over 100 simulations of PP 
and NGCA we find a transition in the error criterion to a failure mode for 
the test models (A), (C) between d = 30 and d = 40 and between d = 20 
and d = 30 respectively. For the test models (B),(D) and (E) we found a 
relative continuous increase in e(X,X) for the methods PP and NGCA. In 
comparison SNGCA fails to analyze test model (A) independently from the size 
of the sampling, if the dimension exceeds d = 12 . Concerning test model (B) 
there is a sharp transition in the simulation result between d = 35 and d = 40 . 

Failure modes: In order to provide a better insight into the details of the 
failure modes we present box plots of e(X, X) in the transition phases w.r.t. 
the models (A) and (B). 



model (A) 



0.66 



0.33 




0.66 



0.33 



15 17 
dimensions 
model (B) 




37 39 
dimensions 



Figure 4.6: failure modes of SNGCA - upper figure: model (A) - lower figure: model(B) 



Figure 14.61 demonstrates the differences in the transition phases of model (A) 
and (B) respectively. The transition phase is characterized by a high variance 
of the estimation error. For model (A) the increase of the variance cr^ of 
e(X,X) beginning at dimensions 13 and its decrease beginning at dimension 
15 indicates that a sharp transition phase happens in the interval [13, 15] . 
For higher dimensions more iterations of SNGCA have a decreasing effect on 
the estimation result. This indicates that by the sampling of the measurement 
directions, we can not detect the non-Gaussian components of the data density. 
For model (B) the transition phase starts at dimension 35 and ends at 
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dimension 43 . 

Moreover the decrease of towards higher dimensions and the increase of 
the mean of e{I,I) is much slower. This indicates that the non-Gaussian den- 
sity components might be detectable if we would allow much more iterations 
of SNGCA and an enlarged size of the set of measurement directions. This ob- 
servation motivates the interpretation that the Monte-Carlo sampling of the 
measurement directions is a very poor strategy that fails to provide sufficient 
information about the Laplace distribution in high dimensions. Currently the 
SNGCA performance is limited by the sampling strategy. 

4.2 Application to real life examples 

We consider a simulating of a mixture of oil and gas flowing under high 
pressure through a pipeline. Under these physical conditions different phases 
of the oil-gas-mixture may exist at the same time in the phase space T . Only 
some of these phase configurations in F are stable over long periods of time. 
Consequently one expects some clusters of points in T indicating the physical 
state of the mixture. The 12 -dimensional data set, obtained by numerical 
simulations of a stationary physical model, was already used before for testing 
techniques of dimension reduction [3j. The data set comes with a subset of 
training data and a subset of test data. The length of the time series is 1000 
in each dimension. 

The task with this data is to find the clusters representing the stable config- 
urations in the training data set. It is not known a priori if some dimensions 
are more relevant than others. However it is known a priori that the data is 
divided into 3 classes, indicated by different shapes of the data points. The 
cluster information is not used in finding the EDR-space. Again we compare 
SNGCA with NGCA and PP using the hyperbohc tangent index ([23]) . For PP 
and NGCA the results are shown in figure B77l They were already published in 




Figure 4.7: left: 2D projection of the "oil flow" data manually chosen from 3D projection 
obtained from by vanilla FastICA methods using the tanh index - right: projection obtained 
by NGCA using a combination of Fourier, tanh, Gauss-pow3 indices 
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Figure 14.71 shows a slice through T such that the structure in the data set 
becomes visible: Using NGCA we can distinguish 10 — 11 clusters versus at 
most 5 for PP with index ()2.5p . 

For the SNGCA method the results are shown in the figure SNGCA iden- 
tifies 3 non-Gaussian dimensions. All figures are rotated by hand such that the 
separation of the cluster is illustrated at best. The next figure ITSl shows the 
result of the oil-fiow data obtained from SNGCA using a combination of the 
indices ()2.5p and ()2.6p . In this case we can distinguish 10 — 11 clusters versus 





-0.5, 




X 



Figure 4.8: phase configurations of the "oil flow" data with apriori cluster mapping induced by 
crosses, circles and triangles obtained by SNGCA using a combination of asymmetric-Gauss 
and the tanh index 

at most 5 for PP. Moreover we confirm the result of NGCA on the data set. 
The clusters are clearly separated from each other on the SNGCA projection. 
Only on the PP projection they are partially confounded in one single cluster. 
By applying the projection obtained from SNGCA to the test data, we found 
the cluster structure to be relevant. We conclude that SNGCA gives a more 
relevant estimation of I than PP. However it is found that the family of func- 
tions h^{x) is an important tuning parameter in SNGCA: If we use only the 
tanh-index, we found only 6-7 cluster are identified and they are partially con- 
founded. Hence a combination should be used in order to cope with symmetric 
data distributions. 

5 Conclusion 

We propose a new improved methodology for the non-Gaussian component anal- 
ysis, as proposed in [21]. As well as NGCA the suggested method is based on a 
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semi-parametric framework for separation an uninteresting multivariate Gaus- 
sian noise subspace from a linear subspace, where the data are non-Gaussian 
distributed. Both methods assume that the non-Gaussian contribution to the 
data density contains the structure in a given data set. The combined strat- 
egy of convex projection and structural adaptation provides promising results 
of SNGCA. Moreover SNGCA provides an estimate for the dimension of the 
non-Gaussian subspace. On the other hand, the quality and the numerical com- 
plexity of Monte-Carlo sampling of the measurement directions is the main 
limitation of the proposed technique. 

A Statistical tests 

In this section we shortly report the statistical tests on normality used the 
dimension reduction step of SNGCA. 

In order to detect a significant asymmetry in the distribution of the original 
data projected on the semi-axis of the numerical approximation of the rounding 
ellipsoid £^ we use the if ^ -test according to D'Agostino-Pearson [31]. The 
D'Agostino-Pearson test computes how far the empirical skewness and kurtosis 
of the given data distribution differs from the value expected with a Gaussian 
distribution. The test statistic is approximately distributed according to the 
X2 -distribution and its empirical data counterpart is given by 

1 ^ 

i=l 
1 ^ 

i=l 

Here /i denotes the empirical mean, a the empirical standard deviation of 
the data and Z{-) denotes a normalizing transformations of skewness and 
kurtosis. The test is more powerful w.r.t. an asymmetry of a distribution. 

Furthermore we use the EDF-test according to Anderson-Darling [I] with the 
modification of Stephens [25]: Let -F/v be the empirical cumulative distribution 
function and F the assumed theoretical cumulative distribution function. The 
test statistics T measures the quadratic deviations between Fjy and F : 

T = [ [Fn{x) - F{x)fu{x) dF 
Jr 

where iy{x) is the weighting function u{x) = [F/v(x)(l — Fn{x))]~^ . In sum 
the data counterpart of T is given by 

f = -cN -cEti N-\2i - l)[\og{F{a-\X, - ^)) 

+ log(l-F(a-i(Xjv_m-/^))] 

where c = 1 + 0.75A^~^ + 2.25A^~^ . Again /_f is the empirical mean and s the 
empirical standard deviation of the data. We compute T to detect deviations 
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from normality in the tails of the projected distributions. The test is rejected if 
T exceeds a critical value cv specific for a given level of significance: 



a : 


0.10 


0.05 


0.025 


0.01 


0.005 


cv : 


0.631 


0.752 


0.873 


1.035 


1.159 



The last test, applied to the projected data is the Shapiro- Wilks test [22] based 
on a regression strategy in the version given by Royston [20^ [2T] : 

N/2 

b = QAr-i+l(^Af-i+l — Xj) 

i=l 

[ai,...,aN) = Tv^i^v^i M/2 

In this test m = (mi, . . . , m^) denotes the expected values of standard normal 
order statistics for a sample of size N and S is the corresponding covariance 
matrix. 



B Proofs 

B . 1 Proof of Theorem H 

We use the following result from the empirical process theory (similar state- 
ments under slightly different assumptions can be found e.g. in [26]). Let B 
stand for the unit Euclidean ball, centered at the origin. Similarly, B{fj,,uj°) = 
{uj : — Li;°||2 < /^} is a ball of radius fi centered at uj° . For a function 
q{uj,x) , denote lEN[q{uJ,X)] = iV~^ Y.'Li li'^^^i) ■ 

Lemma 1. Let q{uj,x) be a continuously differentiable function of uj ^ and 
X G M"' such that for every lo £ Bd 

Var[(?(cu,X)] <(?*, Gov [V^g(u;, X)] < g*/, (B.12) 

with some q* ,q* > . Define 

C(u;) = N'/^JEMi^, X)] - ]E[q{u, X)] } 

and C{uj,lli') = ({lo) — C{ll>') ■ Then for any no > 1 , there is XI = A^(no) > 
such that for any lo° £ Bd , /i < 1 , and A < A*A^^/^ 

log^exp[AC(tJ°)] < nog*AV2, (B.13) 

rA 

log IE exp 

where td = X^fc^i ^ log(2''''') = Ad\og2 . Moreover, define 
3(A) =no(g72 + 2(/*)A2 + Crf. 

Then for any e > 



- sup C{uj,u:°)\ <2nQq*\^ + td, (B.14) 



ipf sup C(u;) >2A-i [3(A) + log £-1]) <£. 
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Proof. Define for uj G Bd 

go{X;u;) = logiEexp 



-.{q{LO,Xi)-lE[q{u;,X^)]} 



Tlien gQ(X;uj) is analytic in A and satisfies gQ{0;Lj) = gQ{0;uj) = 0. Moreover, 
the condition (IB.12P implies gQ{0;uj) < 1. Therefore, there is some > 
such that for any Ai < A^ and any unit vector lo , it holds gQ{Xi;Lo) < Af/2. 
Independence of the 's implies (iRTal) for A < XlN^^^{noq*)-'^^^ . In the 
same way, for ijj,u & define (^{io,X) = \7^q{io,Xi) — lE[Vajq{uj, Xi)] and 

g{X; uj, u) = log E exp [-^^=C(w, Xi)] . 

Then similarly to the above, the function g{X; to, u) is analytic in A and satisfies 
with some A| > , any Ai < A^^ and any unit vectors u and uj 



g{Xi-uj,u) < 2A 



1- 



The bound (|B.14p is derived from [23], Lemma 5.1. Independence of the Xi 's 
yields for A < AJiVi/2(nog*)-i/2 



log IE exp 



2A 



:'u^VC(w) } < 2A 



This means that the condition (£-D) of |2^ is verified and the result (|B.14p 
follows from [23], Lemma 5.1. Introduce a random set A = {(A/ 2) sup^ ((uj) > 
3(A) + loge~^} . and A'^ is its complement. By the Cauchy-Schwartz inequality 

]P{A') < iEexp|^supCM-3(A)-loge"^| 

< eIE^/^exp{XC{uj°) -noq*X^/2} 

X ^^/2exp{AsupC(t^,w°) - 2noq*X'^ - e^} < e 

and the last result follows. 

□ 

The result of Lemma [1] can be easily extended to the case of a vector function 
q{oj,x) e R'^: 

IP (sup IICH Hoc >2A-i [3(A) + log(d/e)]') <e. 

This fact can be obtained by applying Lemma [1] to each component of the vec- 
tor Ci'^) ■ The term log{d/e) is responsible for the overall deviation probability. 

Let now f{x,Lo) be a twice continuously difFerentiable function uj & Bd and 
X G M'^ such that for every j < d, lu ^ Bd, and x G R'^ , it holds 



Var [Xj fix, Lo)] < , Gov [X, V^/(X, lo)] < f^I, 



Var 



</;, Gov 



< f;i, 
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B. 2 Proof of Theorem 



Then for any no > 1 , there is A J = A^(no) > and for any e > , a random 
set A with 1P{A) > 1 — e such that on A it holds by Lemma [T] 

sup \\lEN[Xf{X,uj)] - lE[Xf{X,u)]\\^ < 5n, 

sup ||iE;;v[V./(^,^)] - iE[V,/(X,L^)]|| < 5^, 



where 



= N-^l^ inf {5no/r A + 2\-^ [u + log(2d/e)] }. 



A<A*Afi/2 

By construction of vectors 7/ and rji , it holds on A 

max ||7i - 7/II00 < 6n, max \\rji - r]i\\oo < 6n ■ 

1<1<L 1<1<L 
This implies for any ||c||i < 1 

117(c) - 7(c)||oo < II^(c) - ??(c)||oo < 5n- 
The constraint 7(c) = implies ||7(c)||oo < ^at , thus 

||7(c)||2 < ^d5M. 

and by ([O) 

||(/-n*)^(c)||2 

< ||(/ - n*){^(c) - 7?(c)}||2 + ||(/ - n*)7Kc)||2 

< ||^(c)-7?(?)||2+||S-S(c)||2 

B.2 Proof of Theorem H 

Let S stand for the convex envelope of {ib/3j}^^^ . As £i(i?) is inscribed in 8 , 
its support function ^£1(5) (2;) = max^efiiC-B) •^"''^ is majorated by that of 8 : 

Ui{B){v) < i${v) = max Iv"^/?,!, for any u G M''. 
j=i,...,j 

Next, the support function of the ellipsoid 8,i{B) is 

so that the condition — Pj\\2 < £» implies 

v~^B^^v < max jv'^Af < £»^ 
j=i,...,j 

for any v -LI . 
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Let us prove the second claim of the proposition. Let H* be a projector onto X . 
By the assumption of the proposition there exist coefficients with < 1 

such that 



S 



dcf 1 



>- 0. 



This imphes ()2.f ip . Now, for any such S and its pseudo-inverse , the ehip- 
soid, E{{S~^) with 

£{(5+) = {xel\ x'^S+x< 1} 



is inscribed into S. Indeed, the support function ^^f(^g+-^{x) = {x'^ Sx)^/"^ of 
this eUipsoid fulfills for x G Bd 



1/2 



/ \ 1/2 



< max \x Bj\ = £s(a;), 

i<j<J 



Now we are done: as the ellipsoid £{(S'+) is inscribed into S , it is contained 
in the concentric to £i(i?) ellipsoid E^{B) which covers S. 

To show the last statement of the theorem, observe that 

Tr[(n - ii*f] = 2{m - Tr[n*n]) = 2Tr[(/ - n*)n] . 

On the other hand, using the second claim one gets 



Tr[(/-n*)n] < {d-m)supv~^Ui 



< {d — m) sup 



< 



2d3/2g2 

A* - 2£-2 • 



C The algorithm 

Here we present the full algorithmic description of the SNGCA 
procedure. We start with the linear estimation subprocedure: 
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Algorithm 1: linear estimation of P{iph,c) 
Data: Y,L,J 
Result: 

Sampling: choice of measurement directions 
for j=l to J do 
for 1=1 to L do 

Compute: 

end 

Compute Cj as in ()2.8p and /3j = '^j'^i • 
end 



The following subprocedure reports the computation of 
the -v/d-rounding ellipsoid based on a proposal in 

Algorithm 2: Compute of the -rounding of the MVEE 

Data: {3j}i=i 
Result: B , 

Let 6^ = maxi<j<j {(3j,BiPj) and set Vi = 5^ . 
Let Bq be the inverse empirical covariance matrix of the 

and set U = Vi{5f - . 
Moreover let i be the loop index, 
repeat 

Xi — Bifi}^* 

Bi+i = (1 - ti)-^ [b, - ti{l + Ui)-^XixJ^ 

= (1 - ur^ (sf -u{i + i^.yHPk* , x^)'') 

until S'^* < C ■ d where C is a tuning parameter. 



The next algorithm [3] reports the pseudocode for constructing a reduced basis 
of the target space from the estimated elements by means of algorithm O 

Algorithm 3: Dimension Reduction 

Data: B 

Result: (^first m eigenvectors of B^ 

Let V be the matrix of eigenvectors Vi from B 

computed according to algorithm O 

for i=l to d do 

Project the data orthogonal on Vi . 

Compute tests on normality of the projected data, 
end 

Discard every eigenvector with associated normal 
distributed projected data. 



C The algorithm 
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In algorithm [T] we start with a random initiaUzation of the non-parametric 
estimator [3j by means of a Monte-Carlo sampling of the directions ujji and 
. However we can use the result of the first iteration A; = 1 of SNGCA in 
order to accumulate information about X in a sequence Xi,T2, . . . of estimators 
of the target space. The procedure is described in detail in algorithm HI 

Algorithm 4: structural adaptation of the linear estimation 
Data: (^first m eigenvectors of 

Let {vi}'^i denote the reduced set of eigenvectors from 

B and let k iterations be completed. To initialize iteration k + 1 

choose random numbers Zji, . . . , Zjm 

and uii, . . . ,uim. from and set 

■= 127=1 ^jsVis for 1 < j < ni < J 

■= Y^T=i '^is^is for 1 < / < n2 < L 
Then define uJL-n2i ■ ■ ■ and O-nu • • • )0 analogous to the case 
k = 1 . Now compose the sets 

1?1 ) • • • 1 ?ni ) ?ni+l' • • • ' ?J J 

r (k) (k) (k) (fc). 

For the initialization in the case k = k + 1 . Moreover we choose 

ni = kd and n2 = kd until rii > J — d or n2 > L — d . Otherwise set 

111 = J — d or n2 = L — d . 

Choice of parameters: One of the advantages of the algorithm proposed 
above is the fact that there are only a few tuning parameters. 

i) Suppose now that Ui is an absolute continuous random variable with 
u!i ~ • Without loss of generality we set e = (1,0, ... ,0) . Due to 
the normalization of (loi, . . . , tUd) , it holds: 

iP(|(wi,...,Wd)^e| >0.5) = (Vd)-^ 

However the choice of J and L heavily depends on the non-gaussian 
components. In the experiments we use 7d < J < 18d and 6d < L < 16d . 

ii) Set the parameter of the stopping rule to = 0.05 . 

iii) Set the constant in the stopping rule for the computation of the MVEE 
to C = 2. 

iv) Set the significance level of the statistical tests to a = 0.05 . 
Finally we give a description of the complete algorithm. 
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Algorithm 5: full procedure of SNGCA 



Data: {Xi}^^^ , L , J , a 
Result: I 

Normalization: The data (Aj)^-^ are recentered. Let 
a = ((7i, . . . ad) be the standard deviations of the 
components of Xi . Then Yi = diag(cr~^)Xj denotes the 
componentwise empirically normalized data. 

Main Procedure:; // loop on k 

while ~ StoppingCriterion{1 ,1) do 

Sampling: The components of the Monte- Carlo-paits 
of ^j*"^ and cOj^^ are randomly chosen from • 
The other part of the measurement directions are 
initialized according to the structural adaptation 



approach described in algorithm SI Then and 

(k) 

ujji are normalized to unit length. 

Linear Estimation Procedure: 
for j=l to J do 

for 1=1 to L do 



end 

Compute the coefficients {q}^^ by solving the 
second-order conic optimization problem l\2.8\i : 
min q s.t. 

Ezii(c^-cr)<i, o<c+,cr w 

Compute /3f ^ = ELii^l -c[)v^'^^ 
end 

Dimension Reduction: 

Compute the symmetric matrix B^^^ defining the 
approximation of £ according to algorithm [2j Reduce the basis of X 
according to algorithm [3l 

end 



Complexity: We restrict ourselves to the leading polynomial terms of the 
arithmetical complexity of corresponding computations counting only the mul- 
tiplications. 
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1. The numerical effort to compute r]ji and 7^7 in algorithm [T] heavily de- 
pends on the choice of h{uj^ x) . Let h{u}~^ x) = tanh(a;'''x) . Then this 
step takes 0{J{\ogN)'^ N"^) operations. 

2. Algorithm [2] takes 0{(fJ\og{J)) operations [18] . 

3. For the optimization step in [T] we use a commercial solvei0 based on an 
interior point method. The constrained convex projection solved as an 
SOCP takes 0{d?"rfi) operations there n is the number of constraints. 

4. Computation of the statistical tests in one dimension: Let N denote the 
number of samples. D'Agostino-Pearson-test needs 0{N^ log N) and the 
Anderson-Darling-test O {{log N)'^N'^) operations. The test of Shapiro- 
Wilks takes 0{N'^) . In order to avoid robustness problems [H] the num- 
ber of samples is limited to < 1000 . For larger data sets, N = 1000 
points are randomly chosen. 

Hence without tests I is computed in O {J {log N)'^N'^ + cfjlog{J) + cPn^) 
arithmetical operations per iteration. 
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